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Abstract 

In this work we investigate the relationship be- 
tween Bregman distances and regularized Lo- 
gistic Regression model. We present a de- 
tailed study of Bregman Distance minimization, 
a family of generalized entropy measures asso- 
ciated with convex functions. We convert the 
LI -regularized logistic regression into this more 
general framework and propose a primal-dual 
method based algorithm for learning the param- 
eters. We pose LI -regularized logistic regression 
into Bregman distance minimization and then 
apply non-linear constrained optimization tech- 
niques to estimate the parameters of the logistic 
model. 



1 Introduction 

We study the problem of regularized logistic regression as 
proposed by |5 1 and 1 12 1. LI regularization has been stud- 
ied extensively during recent years due to the sparsity of the 
classifiers obtained by such regularization ifTTI . The objec- 
tive function in the il-regularized LRP (Eqn.|4|l is convex, 
but not differentiable (specifically, when any of the weights 
is zero), so solving it is more of a computational challenge 
than solving the L2-regularized LRP. Despite the additional 
computational challenge posed by Ll-regularized logistic 
regression, compared to L2-regularized logistic regression, 
interest in its use has been growing. The main motivation 
is that Ll-regularized LR typically yields a sparse vector 
A, i.e., A typically has relatively few nonzero coefficients. 
(In contrast, iy2-regularized LR typically yields A with all 
coefficients nonzero.) When Xj = 0, the associated logistic 
model does not use the jth component of the feature vector, 
so sparse A corresponds to a logistic model that uses only 
a few of the features, i.e., components of the feature vec- 
tor Indeed, we can think of a sparse A as a selection of the 
relevant or important features (i.e., those associated with 
nonzero Xj), as well as the choice of the intercept value 



and weights (for the selected features). A logistic model 
with sparse A is, in a sense, simpler or more parsimonious 
than one with non-sparse A. It is not surprising that Ll- 
regularized LR can outperform L2-regularized LR, espe- 
cially when the number of observations is smaller than the 
number of features. 

Our work is based directly on the general setting of fTT\ in 
which one attempts to solve optimization problems based 
on general Bregman distances. They proposed the iterative 
scaling algorithm for minimizing such divergences through 
the use of auxiliary functions. Our work builds on sev- 
eral previous works which have compared divergence ap- 
proaches to logistic regression. We closely follow the work 
by [5 1 who propose a new category of parallel and sequen- 
tial algorithms for boosting and logistic regression based 
on Bregman distance minimization. They are one of the 
first to connect the fields of regression and generalized di- 
vergences, but as such unconstrained logistic parameter is 
unreliable for large problems and hence we take up this 
study to tie constrained optimization to the existing work. 

Most of the work related to connecting the idea of Breg- 
man distance and logistic regression minimize the uncon- 
strained auxiliary function at each step. In this work we 
pose the problem with box or LI constraints due to the fa- 
vorable properties of LI regularization for cases with large 
dimensions but relatively fewer number of training data 
points. 

2 Logistic Regression 

Let S = {{xi, yl), . . . , {xm, j/m)) be a set of training ex- 
amples where each instance Xi belongs to a domain or in- 
stance space X > each label j/i G { — 1,+1}. 

We assume that we are given a set of real-valued functions 
on denoted by hi where i — {1,2,..., n}. Follow- 
ing convention in the Maximum-Entropy literature, we call 
these functions features; in the boosting literature, these 
would be called weak or base hypotheses. Note that, in 
the terminology of the latter hterature, these features cor- 



respond to the entire space of base hypotheses rather than 
merely the base hypotheses that were previously found by 
the weak learner We study the problem of approximating 
the UiS using a linear combination of features. That is, we 
are interested in the problem of finding a vector of parame- 
ters A e M" such that f\{xi) = ^j^ji^i) is a good 
approximation of j/i. 

For classification problems, it is natural to try to match the 
sign of f\{xi) to Ui, that is, to attempt to minimize 



(1) 



where /{c} — 1 whenever {c} is true. This form of loss is 
intractable for in its most general form and so some other 
non-negative loss function is minimized which closely re- 
sembles the above loss. 



In the logistic regression framework we use the estimate 

1 



V{y = +l\x] = 



l+exp(-/A(x)) 
and the log-loss for this model is defined as 



(2) 



£(x,y) =^hi(l + exp(-2/,/A(xO)) (3) 

This is the loss function for the unconstrained minimization 
problem. But as pointed out earlier regularized loss func- 
tions are effective for most practical cases and hence we 
would try to pose the optimization problem with the reg- 
ularized loss function. The regularized loss function can 
now be written as 

m 

^(x, y) = ^ hi(l + exp{~yJx{xi))) + R{\) (4) 
i=i 

where R{\) is the regularization function and can have dif- 
ferent forms depending on the regularization method. For 
L\ regularization the function R is defined as a|A|i. 

3 Bregman Distance 

Let F : A — > M be a continuously differentiable and 
strictly convex function defined on a closed convex set 
A C Wj^. The Bregman distance associated with F is de- 
fined for p, q G A to be 



^(p) = ^Pi^'i^Pi 

i=l 

Bp is the unnormalized relative entropy, defined as Djj 

r 

Du{p\\ q) = 5^( Kln( | ) + - ft ) 



A graphical representation of Bregman distance as a mea- 
sure of convexity is shown in Fig.[T] 



f(p)-f(q)-f(q)(p-q) 




Figure 1: The Bregman distance || g) is an indication 
of the increase in f{p) over f{q) above linear growth with 
slope f'{q). 

The distances Bp were introduced in by Bregman |4 1 along 
with an iterative algorithm for minimizing Bp subject to 
linear constraints. Bregman distances have been used ear- 
lier by numerous authors to pose problems as generalized 
divergences. |7 1 used such divergences for generalized non- 
negative matrix approximations. |1J used them for clus- 
tering applications. Other divergence minimization ap- 
proaches have been tried for data mining and information 
retrieval. The concept of posing numerous problems of 
density estimation as KL divergence minimization problem 
has been long studied. It can be shown that KL divergence 
is a specialized case of Bregman divergence and hence the 
comprehensive success of such methods warrants a better 
investigation of Bregman divergence itself. 

To develop the rest of this work we need a few definitions. 
Let A C M*" and let F : A M be a real valued function. 
We assume that A is a closed convex set, and that F is 
strictly convex and on the interior of A. 

Definition 1 For v G W and q G A the Legendre Trans- 
form >Ci? (v, q) is defined as 



Cp{v,q) = argminSF(p || q) + v • q 



Bp{p II q) = F{p) - F{cO - VF(q) • (p - q) (5) 



Lemma 1 The mapping v, q i— )• Cp{'v, q) defines a smooth 
action o/M^ on A by 



For instance when 



Cp{v, £p{w, q)) = £f((v + w), q). 



The optimization problem which we consider is the fol- 
lowing: let A he an n X r matrix of linear constraints on 
p e A. Let qo G A be a default distribution, chosen such 
that V-F(qo) = 0. Finally, let p e A be given, which 
is considered the empirical distribution, since it typically 
arises from a set of training samples that determine the Un- 
ear constraints. 

We now define V{A, p) and Q{A, qo) as 



V{A,p) = {peA|Ap = Ap} 
Q(A,qo) = {qe A|q = /:^^((A^A),qo),Ae]R"} 

The following well-known theorem fT2l establishes the du- 
ality between the two natural projections of Bp{p \\ q) 
with respect to the families ViA, p) and Q{A, qo) 

Theorem 1 Suppose Bf{p || q) < oo and let Q{A, qo) = 
cl(Q(v4, qo))- Then there exists a unique q^, G A such that 



^B(p||q)= Vp.ln^ + (l-p.)ln^-^ (7) 



/ ^ 1 

i=i 



For this choice of F the Legendre transform is found to be 



1 - q, + g^e-"- 



(8) 



Now we define the constraint matrix A as Aji ~ yihj{xi) 
from which we get Vi — {X^ A)i — X]j=i ^jUi^jixi) 

Now, if we put qo = (1/2)1 into eqn.|8]we get the logistic 
probability eqn.|2] 



Also note that 



Db{0 II q) = -^ln(l-ft 



(9) 



1. q* e V{A,p)nQiA,qo) 

2. Bp{p II q) = Bf{p II q,) + ^^(q, || q) for any 
p e ViA, p) and q e Q{A, qo) 

3. q^ = argminSF(p || q) 

qeQ 

4. q^ = argminSF(p || qo) 



Moreover, any of these four properties determines q^ 
uniquely. 

Note that since we have defined VF(qo) — 0, 

argmini?i?(p || qo) = argminF(p). Property 2. is 

called the Pythagorean property since it resembles the 
Pythagorean theorem if we imagine that Bp{p \\ q) is the 
square of Euclidean distance and (p, q^, q) are the vertices 
of a right triangle. 

4 Bregman Distance to Logistic Regression 

In this section we study the minimization problem as men- 
tioned in the previous section. By unconstrained we mean 
that the parameters A € M" are free. We pose the logis- 
tic regression problem in the Bergman distance framework 
which was developed by Collins and Schapire ||5l. 

The key idea is to write the function F{p) as 



F{p) = ^pMpi + [l ~ Pi)\n{l - Pi) 



(6) 



i=l 



which gives 



^(x,y) = ^ln(l + e(-^'-'^^(^')) (10) 
1=1 

= DBmCpiX'A^cia)) 
where fx{xi) = YTj=i ^]hj{xi) 

Finally, we can write the equivalent optimization problem 

as 



min Db(0 II q) 



St 



Aq==0 



(11) 



where as before Q = cl(Q), where 



The resulting Bergman distance is 



qeA:g, = a( ^^=1 A,2/./i, (a:,) ) , A e M" } 

where a{x) = (1 + e^)^^ is the Sigmoid function. For our 
choice of go ~ (1/2)1 we have £i?(t>,qo)i — a{vi) as 
shown in Eqn. [8] Also, since each of the elements of q is 
Sigmoid function output, therefore, A e [0, 1]™. 

The key points to note in this derivation are 

a. p = 

b. AeM" 



The implication of the point (a.) above is that the con- 
straints are homogenous. This is a strong assumption on 
the constraints. It so turns out that we can relax this con- 
straint only when we put some additional constraints on 
the free parameter A. This points to a regularized scheme, 
where the first constraint is relaxed on the cost of putting 
some additional constraints on the second condition. We 
redefine the set Q as 



Q = {^■.q^^ G{Y^X^y,h,{x,)),\e M", II All 1 < c} 
i=i 

We consider supervised learning in settings where there are 
many input features, but where there is a small subset of 
the features that is sufficient to approximate the target con- 
cept well. In supervised learning settings with many input 
features, over-fitting is usually a potential problem unless 
there is ample training data. For example, it is well known 
that for un-regularized discriminative models fit via train- 
ing error minimization, sample complexity (i.e., the num- 
ber of training examples needed to learn "well") grows lin- 
early with the VC dimension |14|. Further, the VC dimen- 
sion for most models grows about linearly in the number 
of parameters fT3l, which typically grows at least linearly 
in the number of input features. Thus, unless the training 
set size is large relative to the dimension of the input, some 
special mechanism, such as regularization, which encour- 
ages the fitted parameters to be small is usually needed to 
prevent over-fitting. 

Once we have defined our optimization problem our aim is 
to find a sequence of — Cp{Xj^A, qo) which minimizes 
our cost function, all the while remaining feasible to the 
additional regularization constraint 1 1 A 1 1 i < c. 

5 Auxiliary Function 

The idea of auxiliary functions was proposed by Delia 
Pietra et al. |12|. The idea is analogous to EM algorithm 
and tries to bound the error for two iterations. Since we 
are dealing with distances which are defined to be positive, 
so the quantity \\dt^i — dt\\ = — (dt+i — dt) for strict de- 
scent, which can be minimized iteratively, till convergence 
is achieved. 

Definition 2 For a linear constraint matrix A, if \ £ K". 
A function A : M" x A ^ R is an auxiliary function for 
L{q) = -Bf{p II q) if 

1. For all g e A and A e M" 

L{Cf{\^A, q))>L{q)+A{\,q) 

2. A{\, q) is continuous in e A and in A e M" 
withy^(0,g) = Oand 



j^\,=,A{t\q) = ^^U=oLiCAmfA),q)) 

3. If A = is a minima of .4(A, q), then q^A — PqA. 

Theorem 2 Suppose q^ is any sequence in A with q^ = 
go and g'^'^^ = Cpi^"^ A, g) where A € M" satisfies 

A{\k,q^) = supyl(A,g'') 

A 

Then L{q^) increases monotonically to maxL(g) and q^ 
converges to the distribution q^, = argmaxL(g). 

The proof of this theorem is elucidated in Delia Pietra et 
al. ||T2l. We will mention the three lemmas on which the 
proof is based. Once the lemmas have been proved the 
proof for the theorem can be drawn simply from them. The 
three lemmas are 

1. If m G A is a cluster point of qC'^ then A, q^*^)) < 
Oforall A e M". 

2. If m e A is a cluster point of q^''\ then 
^^\t^QL{CFitX^A,q(-'''>)) = Oforall A e M". 

3. Suppose {q'^'^-'} is any sequence with only one cluster 
point . Then q'^'"' converges to q^. 

6 Constrained Bregman Distance 
Minimization 

Once we have shown the analogy between logistic regres- 
sion and Bregman distances, we can proceed to find a suit- 
able auxiliary function for our problem. One key observa- 
tion is that we can write qk+i as a simple function of qk as 
follows 

qfe+i = CpiiXk+SkfA, qo) 
= CpiSlA, CpiXk, qo)) 
- Cf{SIA, q,) 

Let us denote v = Sj^A, hence we can write q'''+^ = 
Lf{v, qk)- Now, from Eqn.|9] we can write 



^B(0||g'=+')-i?i3(0|lq'=) = ^InCl-g. + g.e-"') 

m 

< 5]g.(e-'^--l) 

i=l 

Substituting, [8^ A)i ~ v^, we define our auxiliary func- 
tion as 



i=0 



(12) 



It can be easily verified that the above choice of auxil- 
iary function satisfies the conditions mentioned in Def 2. 
Now we need to find a sequence of {5'^} -> for which 
■^{^, q) < and A{6, q) ^ monotonically. 

7 Algorithm 

Assumptions: F : A ~> R, such that {q <= A : 

Bp{0 II q) < c} where c < oo. 

Parameters: A g [0, 1]™, F satisfying assumptions in 
part 1, and Qq = (1/2)1. 

Input: Constraint matrix A e [—1,1]"^"', where Aj^ = 
yihj{xi), and Yl^j^i < 1- 

Output: Denote £i?(AfA, Qq) as Cp'. Generate a se- 
quence of Ai , A2 . . . such that 

lim Bf(0\\C^') arg min Bp(0\\C^) 



subject to 

Let Ai = 
For fc = 1,2, 



|A||i <M 



Sk = arg min "V (e"^^^^)' - 1) 



si : II Afe + Sk\\ 1 <u 
Update Afe+i = A^ + 5k 
End For 

8 A Primal-Dual method for LI regularized 
Logistic Regression 

The basic algorithm for the unconstrained case was pro- 
posed by |5 1, but their method finds a lower bound using the 
first order characteristics of the unconstrained minimizer. 
In our case we want to find the constrained minimizer of 
the auxiliary function. Since we need strict non-negative 
A{6, q) < 0, so the new set of conditions are 



arg mm 

5GR" 



St 



1) 



(13) 



II A + ^ II 1 < u 
AiS,q) < 



Analyzing the cost function more closely we find that it can 
be written as 

n 

< 5]|A,,|(e-(''^-^-')-l) 

where Sji = sign{Aji). Absorbing, this constraint into the 
cost function we get 

m n 

arg min ^ ^ |A,,|(e-(''^^^-) - 1) (14) 

st : ||A + ^||i<tt 
A{S,q) < 

Now we define the two quantities 

W+iq) = E 

sign{Aji)—-\-l 
sig7i{Aji) — — l 



W-{q) 



such that at iteration k we have Wj^{qi) and Wj {qt), then 
we can re-write the optimization problem as 

n 

^■"Si^Si E W+{q,){e-'^ - 1) + Wr{q,){e'' - 1) 



st : ||A + 8\\i<u 
A{8,q)<Q 



(15) 



Adopting from |6|, we can now introduce slack variables 
and write the penalty function as 

n 

arg min ^ g{5j) + ae^(sj + tj) 

st : Xj + 6j + Sj — tj ~ Uj (16) 

Where 0(6^) = W+iqt){e-'^ - 1) + W-{qt){e'^ - 1) 
and j = {1, . . .,n}. 

Finally, introducing the log barrier function and absorbing 
the two terms Xj and uj into one term cj = Uj — Xj we get 

n 

arg min E ^ ('^^ ) + ae^{sj +tj)^ ^i<p{sj , tj , ) 



S'fj m ^ J ^ ^ J ^3 — ^3 



(17) 



where <p{sj,tj,rj) = logSj + \ogtj + logfj and ^ is 
the barrier parameter As proposed in [ISl, we decompose 
the problem into a master problem and a sequence of sub- 
problems. We solve the following master problem for a se- 
quence of barrier parameters {/ife} such that lim = 0+ 

where the + sign denotes converging to from the positive 
side 

N 



The sequence of subproblems are exactly same as Eqn. 17 
except the fact that the value of c is held constant while 
solving the sub-problems. The j*'' sub-problem can now 
be written as 



arg mm 

St 



g{S) + a{s + t)- ficl)is,t,r) (18) 

S + s — t = c 
g{S)+r^ 



Proceeding as shown in Convex Optimization f3l, Eqn. 
11.53, the modified KKT conditions can be expressed as 
Xt{x, A, v) = 0, (where the (A, i^) are the multipliers, rede- 
fined again for consistency of notation), where we define 



X:t{x,X,iy) 



where 



X 

fo{x) 
/(^) 
J{x) 

A 
b 



Vfoix) + Jix)^X + A^i^ 
Ax~b 



[ S, r, t ] 

g{S)+a{s + t) - ^cl){s,t,r) 

gis) + r 



(19) 



[ ^G{S), 1, 0, ]- 
[ 1, 0, 1, -1 f 



The Newton step can be now be formulated as 



V^oix) + X^^fix) 


J{xV 


A^ ' 




XJ{x) 











A 














' Vx ' 




^dual 




VA 




^cent 








Xpri 



(20) 



where 



^dual 
^cent 
Xpri 



Vt{x,X,iy) 



9 Experiments and Results 

In this section we report results for the experiments con- 
ducted for the new model proposed in this paper. The 
sparsity introduced by the LI regularization is captured by 
conducting tests on randomly generated data. The loss- 
minimization curves remain similar to the unconstrained 
case since the unit slave problems mentioned in Eqn. 17 
are convex. But the sparsity of feature vectors enables the 
dropping of redundant features and hence speeds up the it- 
erations. 

In our experiments, we generated random data and classi- 
fied it using a very noisy hyperplane. We investigate only 
2-class classification problems in this work. We investigate 
medium to high dimensional problems where the dimen- 
sionality ranges from 20 — 500. We tested both the scenar- 
ios a) when the number of training points is of the order of 
the feature dimension and b) when the number of the train- 
ing data points is an more than an order from the feature di- 
mension. For every case the random data is first classified 
based on a random hyperplane and then we add Gaussian 
noise to the data dimensions based on a coin flip. The noise 
is assumed to be e ^ A/^(0, cri), where a < 1. The key 
point of interest is the fact that since the procedure men- 
tioned in this work decouples the features, and hence the 
features are dropped from the optimization scheme when 
the change WSi drops below some threshold. One such 
comparative plots are shown in Fig. |2] (left). The sparsity 
of feature is shown in Fig. |2] (right). 

For comparing with other algorithms we run the logistic 
classifier over public domain data namely the Wisconsin 
Diagnostic Breast Cancer (WDBC) data set and the Musk 
data base (Clean 1 and 2) COI. The WDBC data has 569 in- 
stances with 30 real valued features. There are 357 benign 
(positive) instances and 212 malignant (negative) instances. 
The best reported result is 97.5% using decision trees con- 
structed by linear programming f9', T\. Our method gen- 
erate 16 fakse negatives and 23 false positives, totaling 39 
errors with an accuracy of 93.15%. The training and testing 
errors are shown in Fig.|3](left). 

The musk clean 1 data-set describes a set of 92 molecules 
of which 47 are judged by human experts to be musks and 
the remaining 45 molecules are judged to be non-musks. 
Similarly, the musk clean 2 data base describes a set of 102 
molecules of which 39 are musks and the remaining 63 
molecules are non-musks. The 166 features that describe 
these molecules depend upon the exact shape, or confor- 
mation, of the molecule. Multiple confirmations for each 
instance were created, which after pruning amount to 476 
conformations for clean 1 and 6598 for clean 2 data-set. 
The many-to-one relationship between feature vectors and 
molecules is called the "multiple instance problem". When 
learning a classifier for this data, the classifier should clas- 
sify a molecule as "musk" if ANY of its conformations is 





Figure 2: Left: Test Error, regularized (blue) and unconstrained (red) for 500D, Right: Dropped features as a percentage 
of the total features. 



classified as a musk. A molecule should be classified as 
"non-musk" if NONE of its conformations is classified as 
a musk. 

We report results for tests conducted on the two data-bases. 
The training and test plots for the clean 2 data are shown in 
Fig.|3](right). We compare our method LI Logistic Regres- 
sion based on Bregman Distances (LILRB) against pub- 
lished results and our method outperforms most of them. 
The comparative results are shown in Table. [Tjand Table. [2] 
Also note that the poor performance of C4.5 algorithm has 
been attributed to the fact that it does not take the multi- 
instance nature of the problem into consideration for train- 
ing. We did not take this consideration while training and 
still our method ranks as the top 2 for among all the re- 
ported results. The details for the other methods mentioned 
have been discussed in IS) . 



Algorithm 


TP 


FN 


FP 


TN 


% Acc 


LILRB 


45 


2 


2 


43 


95.6 


Iter-discrim APR 


42 


5 


2 


43 


92.4 


GFS-Elim-kde APR 


46 


1 


7 


38 


91.3 


All-pos APR 


36 


11 


7 


38 


80.4 


Back-prop 


45 


2 


21 


24 


75.0 


C4.5(pruned) 


45 


2 


24 


21 


68.5 



Table 1: Comparative results for the Musk Clean 1 
database. 



10 Conclusion and extensions 

We posed the problem of LI regularized logistic regression 
as a constrained Bregman distance minimization problem 
and posed the optimization problem as a decoupled primal- 
dual problem in each of the dimensions of the parameter 
vector. The optimization technique mentioned in this work 
takes help from the strict feasibility properties of primal 



Algorithm 


TP 


FN 


FP 


TN 


% Acc 


Iter-discrim APR 


30 


9 


2 


61 


89.2 


LILRB 


30 


9 


6 


57 


85.29 


GFS-Elim-kde APR 


32 


7 


13 


50 


80.4 


GFS-El-count APR 


31 


8 


17 


46 


75.5 


All-pos APR 


34 


5 


23 


40 


72.6 


Back-prop 


16 


23 


10 


53 


67.7 


GFS-All-Pos APR 


37 


2 


32 


31 


66.7 


Most Freq Class 





39 





63 


61.8 


C4.5(pruned) 


32 


7 


35 


28 


58.8 



Table 2: Comparative results for the Musk Clean 2 
database. 



dual methods and hence guarantee the convergence of the 
algorithm. Comparative results on published data-sets have 
prove the strength of the regularized method. 
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